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Abstract 

Web searching accounts for one of the most frequently performed computations over the 
Internet as weh as one of the most important appUcations of outsourced computing, producing 
results that critically affect users' decision-making behaviors. As such, verifying the integrity of 
Internet-based searches over vast amounts of web contents is essential. 

In this paper, we provide the first solution to this general security problem. We introduce the 
concept of an authenticated web crawler and present the design and prototype implementation 
of this new concept. An authenticated web crawler is a trusted program that computes a special 
"signature" S over a collection of web contents it visits. Subject to this signature, web searches 
can be verified to be correct with respect to the integrity of their produced results. But this 
signature serves more advanced purposes than just content verification: It allows the verification 
of complicated queries on web pages, such as conjunctive keyword searches, which are vital for 
the functionality of online web-search engines today. In our solution, along with the web pages 
that satisfy any given search query, the search engine also returns a cryptographic proof. This 
proof, together with the signature S, enables any user to efficiently verify that no legitimate 
web pages are omitted from the result computed by the search engine, and that no pages that 
are non-conforming with the query are included in the result. An important property of our 
solution is that the proof size and the verification time are proportional only to the sizes of the 
query description and the query result, but do not depend on the number or sizes of the web 
pages over which the search is performed. 

Our authentication protocols are based on standard Merkle trees and the more involved 
bilinear-map accumulators. As wc; cixperimentally demonstrate, the prototype implementation 
of our system gives a low communication overhead between the search engine and the user, and 
allows for fast verification of the returned results on the user side. 

1 Introduction 

It goes without saying that web searching is an essential part of modern life. When we perform a web 
search, we expect that the list of links returned will be relevant and complete. As we heavily rely 
on web searching, an often overlooked issue is that search engines are outsourced computations. 
That is, users issue queries and have no intrinsic way of trusting the results they receive, thus 
introducing a modern spin on Cartesian doubt. This philosophy once asked if we can trust our 



senses — now it should ask if we can trust our search results. Some possible attack scenarios that 
arise in this context include the following: 

1. A news web site posts a misleading article and later changes it to look as if the error never 
occurred. 

2. A company posts a back-dated white paper claiming an invention after a related patent is 
issued to a competitor. 

3. An obscure scientific web site posts incriminating data about a polluter, who then sues to get 
the data removed, in spite of its accuracy. 

4. A search engine censors content for queries coming from users in a certain country, even 
though an associated web crawler provided web pages that would otherwise be indexed for 
the forbidden queries. 

An Internet archive, such as in the Wayback Machine, that digitally signs the archived web pages 
could be a solution to detecting the first attack, but it does not address the rest. It misses detecting 
the second, for instance, since there is no easy way in such a system to prove that something did 
not exist in the past. Likewise, it does not address the third, since Internet archives tend to be 
limited to popular web sites. Finally, it does not address the fourth, because such users would 
likely also be blocked from the archive web site and, even otherwise, would have no good way of 
detecting that pages missing from a keyword-search response. 

From a security point of view, we can abstract these problems in a model where a query re- 
quest (e.g., web-search terms) coming from an end user, Alice, is served by a remote, unknown and 
possibly untrusted server (e.g., online search engine), Bob, who returns a result consumed by Alice 
(e.g., list of related web pages containing the query terms). In this context, it is important that 
such computational results are verifiable by the user, Alice, with respect to their integrity. Integrity 
verifiability, here, means that Alice receives additional authentication information (e.g., a digital 
signature from someone she trusts) that allows her to verify the integrity of the returned result. 
In addition to file-level protection, ensuring that data items (e.g., web contents) remain intact, the 
integrity of the returned results typically refers to the following three properties (e.g., [I2]): (1) 
correctness, ensuring that any returned result satisfies the query specification; (2) completeness, en- 
suring that no result satisfying the query specification is omitted from the result, and (3) freshness, 
ensuring that the returned result is computed on the currently valid, and most recently updated 
data. 

The ranking of query results is generally an important part of web searching. However, in the 
above scenarios, correctness, completeness, and freshness are more important to the user than a 
proof that the ranking of the results is accurate. Also, note that it is usually in the best interest of 
the search engine to rank pages correctly, e.g., for advertising. Thus in our context, we are interested 
in studying the problem of integrity verification of web content. In particular, we wish to design 
protocols and data-management mechanisms that can provide the user with a cryptographically 
verifiable proof that web content of interest or query results on this content are authentic, satisfying 
all the above three security properties: correctness, completeness, and freshness. 

1.1 Challenges in Verifying Web Searching 

Over the last decade, significant progress has been made on integrity protection for management of 
outsourced databases. Here, a database that is owned by a (trusted) source, Charles, is outsourced 
to an (untrusted) server. Bob, who serves queries coming from end users such as Alice. Using 
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an authenticated index structure, Bob adds authentication to the supported database responses, 
that is, augments query results with authentication information, or proof, such that these results 
can be cryptographically verified by Alice with respect to their integrity. This authentication is 
done subject to a digest that the source, Charles, has produced by processing the currently valid 
database version. In the literature, many elegant solutions have been proposed focusing on the 
authentication of large classes of queries via short proofs (sent to the user) yielding fast verification 
times. 

Unfortunately, translating such existing methods in the authenticated web searching problem 
is not straightforward, but rather entails several challenges. First, result verifiability is not well 
defined for web searching, because unlike the database outsourcing model, in web searching there 
is no clear data source and there is no real data outsourcing by someone like Charles. Of course, 
one could consider a model where each web-page owner communicates with the online search 
engine to deliver (current) authentication information about the web pages this owner controls, 
but clearly this consideration would be unrealistic. Therefore, we need an authentication scheme 
that is consistent with the crawling-based current practices of web searching. 

Additionally, verifying the results of search engines seems particularly challenging given the large 
scale of this data-processing problem and the high rates at which data evolves over time. Indeed, 
even when we consider the static version of the problem where some portion of the web is used for 
archival purposes only (e.g., older online articles of the Wall Street Journal), authenticating general 
search queries for such rich sets of data seems almost impossible: How, for instance, is it possible 
to verify the completeness of a simple keyword-search query over a large collection of archival web 
pages? Existing authentication methods heavily rely on a total ordering of the database records 
on some (primary-key) attribute in order to provide "near-miss" proofs about the completeness of 
the returned records, but no such total order exists when considering keyword-search queries over 
text documents. This suggests that to prove completeness to a user the search engine will have to 
provide the user with "all supporting evidence" — all web contents the engine searched through — 
and let the user recompute and thus trivially verify the returned result. Clearly, this approach is 
also not scalable. Generally, Internet searching is a complicated "big data" problem and so is its 
authentication: We thus need an authentication scheme that produces proofs and incurs verification 
times that are web-search sensitive, that is, they depend on the set of returned results and not on 
the entire universe of possible documents. 

Integrity protection becomes even more complicated when we consider web pages that fre- 
quently change over time. For instance, collaborative systems store large amounts of scientific data 
at distributed web-based repositories for the purpose of sharing knowledge and data-analysis re- 
sults. Similarly, users periodically update their personal or professional web pages and blogs with 
information that can be searched through web search engines. In such dynamic settings, how is it 
possible to formally define the notion of freshness? Overall, we need an authentication scheme that 
is consistent with the highly dynamic nature of web content. 

1.2 Prior Related Work 

Despite its importance and unlike the well-studied problem of database authentication, the problem 
of web searching authentication has not been studied before in its entirety. To the best of our 
knowledge, the only existing prior work studies a more restricted version of the problem. 

The first solution on the authentication of web searches was recently proposed by Pang and 
Mouratidis in PVLDB 2008 [21j. This work focuses on the specific, but very important and repre- 
sentative case, where search engines perform similarity-based document retrieval. Pang and Moura- 
tidis show how to construct an authentication index structure that can be used by an untrusted 
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server, the search engine, in the outsourced database model to authenticate text-search queries over 
documents that are based on similarity searching. In their model, a trusted owner of a collection 
of documents outsources this collection to the search engine, along with this authentication index 
structure defined over the document collection. Then, whenever a user issues a text-search query 
to the engine, by specifying a set of keywords, the engine returns the top r results (r being a 
system parameter) according to some well-defined notion of relevance that relates query keywords 
to documents in the collection. In particular, given a specific term (keyword) an inverted list is 
formed with documents that contain the term; in this list the documents are ordered according to 
their estimated relevance to the given term. Using the authentication index structure, the engine is 
able to return those r documents in the collection that are better related to the terms that appear 
in the query, along with a proof that allows the user to verify the correctness of this document 
retrieval. 

At a technical level. Pang and Mouratidis make use of an elegant hash-based authentication 
method: The main idea is to apply cryptographic hashing in the form of a Merkle hash tree 
(MHT) [H] over each of the term-defined lists. Note that each such list imposes a total ordering 
over the documents it contains according to their assigned score, therefore completeness proofs are 
possible. Pang and Mouratidis observe that the engine only partially parses the lists related to 
the query terms: At some position of a list, the corresponding document has low enough cost that 
does not allow inclusion in the top r results. Therefore, it suffices to provide hash-based proofs 
(i.e., consisting of a set of hash values, similar to the proof in a Merkle tree) only for prefixes of 
the lists related to the query terms. Pang and Mouratidis thus construct a novel chained sequence 
of Merkle trees. This chain is used to authenticate the documents in an inverted list corresponding 
to a term, introducing a better trade-off between verification time and size of provided proof. 

To answer a query, the following process is used. For each term, documents are retrieved 
sequentially through its document list, like a sliding window, and the scores of these documents 
are aggregated. A document's score is determined by the frequency of each query term in the 
document. This parsing stops when it is certain that no document will have a score higher than 
the current aggregated score. 

To authenticate a query, the engine collects, as part of the answer and its associated proof, a 
verification object that contains the r top-ranked documents and their corresponding MHT proofs, 
as well as all the documents in between that did not score higher but were potential candidates. 
For example, consider the following two lists: 

termi : doci, doc5, docs, doc2 
term2 : doc2, docs, docs, doc4, doci 

If a query is "terrrii term2", r is 1 and doci has the highest score, then the verification object has 
to contain doc2, docs, docs, doc4 to prove that their score is lower than doci. 

We thus observe four limitations in the work by Pang and Mouratidis: (1) their scheme is 
not fully consistent with the crawling-based web searching, since it operates in the outsourced 
database model; (2) their scheme is not web-search sensitive because, as we demonstrated above, 
it returns documents and associated proofs that are not related to the actual query answer, and 
these additional returned items may be of size proportional to the number of documents in the 
collection; (3) their scheme requires complete reconstruction of the authentication index structure 
when the document collection is updated: even a simple document update may introduce changes 
in the underlying document scores, which in the worst case will completely destroy the ordering 
within one or more inverted lists; (4) it is not clear whether and how their scheme can support 
query types different from disjunctive keyword searches. 
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1.3 Our Approach 

Inspired by the work by Pang and Mouratidis |21| , we propose a new model for performing keyword- 
search queries over onHne web contents. In Section [2j we introduce the concept of an authenticated 
web crawler, an authentication module that achieves authentication of general queries for web 
searching in a way that is consistent with the current crawling-based search engines. Indeed, this 
new concept is a program that like any other web crawler visits web pages according to their 
link structure. But while the program visits the web pages, it incrementally builds a space-efficient 
authenticated data structure that is specially designed to support general keyword searches over the 
contents of the web pages. Also, the authenticated web crawler serves as a trusted component that 
computes a signature, an accurate security snapshot, of the current contents of the web. When the 
crawling is complete, the crawler publishes the signature and gives the authenticated data structure 
to the search engine. The authenticated data structure is in turn used by the engine to provide 
verification proofs for any keyword-search query that is issued by a user, which can be verified by 
having access to the succinct, already published signature. 

In Section [3j we present our authentication methodology which provides proofs that are web- 
search sensitive, i.e., of size that depends linearly on the query parameters and the results returned 
by the engine and logarithmically on the size of the inverted index (number of indexed terms) , but 
it does not depend on the size of the document collection (number of documents in the inverted 
lists). The verification time spent by the user is also web-search sensitive: It depends only on the 
query (linearly), answer (linearly) and the size of the inverted index (logarithmically), not on the 
size of the web contents that are accessed by the crawler. We stress the fact that our methodology 
can support general keyword searches, in particular, conjunctive keyword searches, which are vital 
for the functionality of online web search engines today and, accordingly, a main focus in our work. 

Additionally, our authentication solution allows for efficient updates of the authenticated data 
structure. That is, if the web content is updated, the changes that must be performed in the 
authenticated data structure are only specific to the web content that changes and there is no 
need to recompute the entire authentication structure. This property comes in handy in more 
than one fiavors: Either the web crawler itself may incrementally update a previously computed 
authentication structure the next time it crawls the web, or the crawler may perform such an 
update on demand without performing a web crawling. 

So far we have explained a three-party model where the client verifies that the results returned 



by the search engine are consistent with the content found by the web crawler. (See also Section 2.1 
for more details). Our solution can also be used in other common scenarios involving two parties. 
First, consider a client that outsources its data to the cloud (e.g., to Google Docs or Amazon Web 
Services) and uses the cloud service to perform keyword-search queries on this data. Using our 
framework, the client can verify that the search results were correctly performed on the outsourced 
data. In another scenario, we have a client doing keyword searches on a trusted search engine that 
delivers the search results via a content delivery network (CDN). Here, our solution can be used 
by the client to discover if the results from the search engine were tampered with during their 



delivery, e.g., because of a compromised server in the CDN. (See Section 2.2 for more details on 
these two-party models). 

Our authentication protocols are based on standard Merkle trees as well as on bilinear-map 
accumulators. The latter are cryptographic primitives that can be viewed as being equivalent 
to the former, i.e., they provide efficient proofs of set membership. But they achieve a different 
trade-off between verification time and update time: Verification time is constant (not logarithmic), 
at the cost that update time is no more logarithmic (but still sublinear). However, bilinear-map 
accumulators offer a unique property not offered by Merkle trees: They can provide constant-size 
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proofs (and corresponding verification times) for set disjointness, i.e., in order to prove that two 
sets are disjoint a constant-size proof can be used — this property is very important for proving 
completeness over the unordered documents that are stored in an inverted index (used to support 
keyword searches). 

In Section |4j we describe implementation details of the prototype of our solution and present 
empirical evaluation of its performance on the Wall Street Journal archive. We demonstrate that 
in practice our solution gives a low communication overhead between the search engine and the 
user, and allows for fast verification of the returned results on the user side. We also show that our 
prototype can efficiently support updates to the collection. We conclude in Section [5} 

1.4 Additional Related Work 

Before we describe the details of our approach, we briefly discuss some additional related work. 

A large body of work exists on authenticated data structures (e.g., [3 [161 123] ) , which provide a 
framework for designing practical methods for query authentication: Answers to queries on a data 
structure can be verified efficiently through the computation of some short cryptographic proofs. 

Research initially focused on authenticating membership queries [1] and the design of various 
authenticated dictionaries [9l[l6l[25] based on extensions of Merkle's hash tree [Hj. Subsequently, 
one-way accumulators [SJ [18] were employed to design dynamic authenticated dictionaries [51 [22] 
that are based on algebraic cryptographic properties to provide optimal verification overheads. 

More general queries have been studied as well. Extensions of hash trees have been used to 
authenticate various types of queries, including basic operations (e.g., select, join) on databases [7], 
pattern matching in tries [13] and orthogonal range searching [21 [13], path queries and connectivity 
queries on graphs and queries on geometric objects [IT] and queries on XML documents [31 [6]. Many 
of these queries can be reduced to one-dimensional range-search queries which can been verified 
optimally in |101 [T9] by combining collision-resistant hashing and one-way accumulators. Recently, 
more involved cryptographic primitives have been used for optimally verifying set operations [23]. 

Substantial progress has also been made on the design of generic authentication techniques. 
In |13j it is shown how to authenticate in the static case a rich class of search queries in DAGs (e.g., 
orthogonal range searching) by hashing over the search structure of the underlying data structure. 
In |llj , it is shown how extensions of hash trees can be used to authenticate decomposable properties 
of data organized as paths (e.g., aggregation queries over sequences of objects) or any search queries 
that involve iterative searches over catalogs (e.g., point location). Both works involve proof sizes and 
verification times that asymptotically equal the complexity of answering queries. Recently, in [26], a 
new framework is introduced for authenticating general query types over structured data organized 
and accessed in the relational database model. By decoupling the processes of query answering and 
answer verification, it is shown how any query can be reduced (without loss of efficiency) to the 
fundamental problem of set-membership authentication, and that super-efficient answer verification 
(where the verification of an answer is asymptotically faster than the answer computation) for many 
interesting problems is possible. Set-membership authentication via collision-resistant hashing is 
studied in |25j where it is shown that for hash-based authenticated dictionaries of size n, all costs 
related to authentication are at least logarithmic in n in the worst case. 

Finally, a growing body of works study the specific problem of authenticating SQL queries over 
outsourced relational databases typically in external- memory data management settings. Repre- 
sentatives of such works include the authentication of range search queries by using hash trees 
(e.g., [71 [11]) and by combining hash trees with accumulators (e.g., fLQl [l9]) or B-trees with sig- 
natures (e.g., [171 I20j). Additional work includes an efficient hash-based B-tree-based authenti- 
cated indexing technique in [12], the authentication of join queries in [27] and of shortest path 
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Figure 1: The three-party model as the main operational setting of an authenticated web crawler. 
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2 Our model 

We first describe two models where our results can be applied allowing users to be able to verify 
searches over collections of documents of interest (e.g., a set of web pages). 

2.1 Three-party Model 

We refer to the three-party model of Figure [TJ (We note that although relevant, this model is 
considerably different than the "traditional" three-party model that has been studied in the field 
of authenticated data structures |24j.) In our model the three parties are called crawler, server, 
and client. The client trusts the crawler but not the server. The protocol executed by the parties 
consists of three phases, preprocessing phase, query phase and update phase. 

In the preprocessing phase, the crawler accesses the collection of documents and takes a snap- 
shot of them. The crawler then produces and digitally signs a secure, collision-resistant digest of 
this snapshot. This digest contains information about both the identifiers of the documents in the 
snapshot and their contents. The signed digest is made public so that clients can use it for veri- 
fication during the query phase. Concurrently, the crawler builds an authenticated data structure 
supporting keyword searches on the documents in the snapshot. Finally, the crawler outsources to 
the server the snapshot of the collection along with the authenticated data structure. 

In the query phase, the client sends a query request, consisting of keywords, to the server. 
The server computes the query results, i.e., the set of documents in the snapshot that contain the 
query keywords. Next, the server returns to the client an answer consisting of the computed query 
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Figure 2: The first two-party model: Client outsources a document collection and uses server to 
store it and perform search queries on it. 



results and a cryptographic proof that the query results are correct, complete, fresljjand that their 
associated digests are valid. The client verifies the correctness and the completeness of the query 
results using the proof provided by the server and relying on the trust in the snapshot digest that 
was signed and published by the crawler. The above interaction can be repeated for another query 
issued by the client. 

In the update phase, the crawler parses the documents to be added or removed from the collec- 
tion and sends to the server any new contents and changes for the authenticated data structure. 
Additionally, it computes and publishes a signed digest of the new snapshot of the collection. 

Note that the query phase is executed after the preprocessing phase. Thus, the system gives 
assurance to the client that the documents have not been changed since the last snapshot was taken 
by the crawler. Also, during the update phase, the newly signed digest prevents the server from 
tampering with the changes it receives from the crawler. 



2.2 Two-party Models 

Our solution can also be used in a two-party model in two scenarios. In Figure [2| we consider a 
cloud-based model where a client outsources its data along with an authenticated search structure 
to the cloud-based server but keeps a digest of the snapshot of her data. In this scenario, the server 
executes keyword-search queries issued by the client and constructs a proof of the result. The client 
can then verify the results using the digest kept before outsourcing. The model shown in Figure [3] 
protects the interaction between a client and a search engine from a man-in-the-middle attack. The 
search engine publishes a signed digest of its current snapshot of the web and supplies every search 
result with a proof. Upon receiving search results in response to a query, the client can use the 
digest and the proof to verify that the results sent from the search engine were not tampered with. 

^We note that in the cryptographic Uterature on authenticated data structures (e.g., [221123) ) these three integrity 
properties are combined into a unified security property. 
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2.3 Desired Properties 

The main properties we seek to achieve in our solution are security and efficiency. The system 
should be secure, i.e., it should be computationally infeasible for the server to construct a verifiable 
proof for an incorrect result (e.g., include a web page that does not contain a keyword of the query). 
Our system should also be practical, i.e., we should avoid a solution where the authenticated crawler 
literally signs every input participating in the specific computation and where the verification is 
performed by downloading and verifying all these inputs. For example, consider a search for 
documents which contain two specific terms. Each term could appear in many documents while 
only a small subset of them may contain both. In this case, we would not want a user to know 
about all the documents where these terms appear in order to verify that the small intersection she 
received as a query answer is correct. Finally, we want our solution to support efficient updates: 
Due to their high frequency, updates on web contents should incur overhead that is proportional 
to the size of the updated content and not of the entire collection. 

We finally envision two modes of operations for our authentication framework. First, the au- 
thenticated web crawler operates autonomously; here, it serves as a "web police officer" and creates 
a global snapshot of a web site. This mode allows the integrity checking to be used as a verifica- 
tion that certain access control or compliance policies are met (e.g., a page owner does not post 
confidential information outside its organizational domain). Alternatively, the authenticated web 
crawler operates in coordination with the web page owners (authors); here, each owner interacts 
with the crawler to commit to the current local snapshot of its web page. This mode allows the 
integrity proofs to be transferable to third parties in case of a dispute (e.g., no one can accuse a 
page owner for deliberately posting offending materials). 

3 Our solution 

In this section we describe the cryptographic and algorithmic methods used by all the separate 
entities in our model for the verification of conjunctive keyword searches. As we will see, conjunc- 
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tive keyword searches are equivalent with a set intersection on the underlying inverted index data 
structure j29l. Our solution is based on the authenticated data structure of [23] for verifying set 
operations on outsourced sets. We now provide an overview of the construction in ^3]. First we 
introduce the necessary cryptographic primitives. 

3.1 Cryptographic Background 

Our construction makes use of the cryptographic primitives of bilinear pairings and Merkle hash 
trees. 

Bilinear pairings. Let G be a cyclic multiplicative group of prime order p, generated by g. Let 
also ^ be a cyclic multiplicative group with the same order p and e:GxG— T-^bea bilinear 
pairing with the following properties: (1) Bilinearity: e{P°',Q^) = e{P,QY^ for all P, Q G G and 
a,b G Tjp] (2) Non-degeneracy: e{g,g) ^ 1; (3) Computability: There is an efficient algorithm to 
compute e{P,Q) for all P,Q gG. We denote with (p,G,G , e, g) the bilinear pairings parameters, 
output by a polynomial-time algorithm on input 1^. 

Merkle hash trees. A Merkle hash tree [I^ is an authenticated data structure [23] allowing an 
untrusted server to vouch for the integrity of a dynamic set of indexed data T[0], T[l], . . . , T[m — 1] 
that is stored at untrusted repositories, hs representation comprises a binary tree with hashes in 
the internal nodes, denoted with merkle(7'). A Merkle hash tree is equipped with the following 
algorithms: 

1. {merkle(T), sig(T)} = setup(7'). This algorithm outputs a succinct signature of the table T 
which can be used for verification purposes and the hash tree merkle(7'). To construct the 
signature and the Merkle tree, a collision-resistant hash function hash(-) is applied recursively 
over the nodes of a binary tree on top of T. Leaf £ £ {0, 1, . . . , m — 1} of merkle(T) is assigned 
the value h£ = hash(^||7~[^]), while each internal node v with children a and b is assigned the 
value /i„ = hash(/ia||/i;,). The root of the tree hr is signed to produce signature sig(7'). 

2. {proof(i), answer(i)} = query(z,7', merkle(7')). Given an index < i < m — 1, this algorithm 
outputs a proof that could be used to prove that answer(z) is the value stored at T[i]. Let 
path(z) be a list of nodes that denotes the path from leaf i to the root and sibl(f) denote a 
sibling of node v in merkle(T). Then, proof (i) is the ordered list containing the hashes of the 
siblings sib(u) of the nodes v in path(i). 

3. {0, 1} = verify(proof(z), answer(z), sig(T)). This algorithm is used for verification of the answer 
answer(i). It computes the root value of merkle(T) using answer(i) and proof(i), i.e., the sibling 
nodes of nodes in path(i), by performing a chain of hash computations over nodes in path(i). 
It then checks to see if the output is equal to hr, & value signed with sig(7~), in which case it 
outputs 1, implying that answer(z) = T[i] whp. 

For more details on Merkle hash trees, please refer to |14j . 

Example: Consider a Merkle hash tree for four data items in Figure |4j A collision resistant 
hash function hash is used recursively to compute a hash value for each node of the tree. On 
query(2, T, merkle(T)) the query algorithm returns 

answer(2) = T[2] , 
proof (2) = {/i3,/ioi}. 

The verification algorithm then computes the values /i2' = hash(2| |T[2]), /123' = hash(/i2'||^3)i 
then checks that 

hr = hash(/ioi||/i23') • 
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h^ = hash(hoJ| h^^ 




ho= hash(0||T[0]) hg 
T[0] T[1] T[2] T[3] 

Figure 4: Example of a Merkle hash tree. 



We now describe in detail the individual functionality of the web crawler, the untrusted server 
and the client. 

3.2 Web Crawler 

The web page collection we are interested in verifying is described with an inverted index data 
structure. Suppose there are m terms qo,q2, ■ ■ ■ , q-m-i over which we are indexing, each one mapped 
to a set of web pages Si, such that each web page in Si contains the term Qi, for i = 0, 1, . . . , m — 1. 
Assume, without loss of generality, that each web page in Si can be represented with an integer in 
Z*, where p is large /c-bit prime. For example, this integer can be is a cryptographic hash of the 
entire web page. However, if we want to cover only a subset of the data in the page, the hash could 
be applied to cover text and outgoing links, but not the HTML structure. The extraction of the 
relevant data will be made by a special filter and will depend on the application. For instance, if 
we are interested in tables of NYSE financial data, there is no need to include images in the hash. 

The authenticated data structure is built very simply as follows: First the authenticated crawler 
picks random value s € Z* which is kept secret. Then, for each set Si {i = 0,1, . . . ,m — 1), the 
accumulation value, 

is computed, where g is a generator of the group G from an instance of bilinear pairing parame- 
ters. Then the crawler calls algorithm setup(7') to compute signature sig(7') and Merkle hash tree 
merkle(7~); the former is passed to the clients (to support the result verification) and the latter is 
outsourced to the server (to support the proof computation). 

Intuition. There are two integrity-protection levels: First, the Merkle hash tree protects the 
integrity of the accumulation values 7~[^] offering coarse-grained verification of the integrity of the 
sets. Second, the accumulation value T[i] of Si protects the integrity of the web pages in set 5^ 
offering fine-grained verification of the sets. In particular, each accumulation value 7~[i] maintains 
an algebraic structure within the set Si that is useful in two ways [22t l23j: (1) subject to an authentic 
accumulation value T[i] (subset) membership can be proved using a succinct witness, yielding a 
proof of "correctness," and (2) disjointness between t such sets can be proved using t succinct 
witnesses, yielding a proof of "completeness." Bilinearity is crucial for testing both properties. 

Handling updates. During an update to inverted list Si of term the crawler needs to change 
the corresponding accumulation value and update the path of the Merkle tree from the leaf 
corresponding to term qi to the root. A new page x' is added to an inverted list Si of term qi if either 
a new page x' contains term qi or the content of page x' that is already in the corpus is updated 
and now contains qi . In this case the accumulation value of qi is changed to T' [i] = T[i] ^ ■ If 
some web page x' £ Si is removed or no longer contains term x' the accumulation value is changed 
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to T'[i\ = T[iY^'^^^^'\ It is straightforward to handle updates in the Merkle hash tree (in time 
logarithmic in the number of inverted lists — see also Section |4]) . 

3.3 Untrusted Server 

The untrusted server in our solution stores the inverted index along with the authentication informa- 
tion defined earlier as authenticated data structure merkle(T). Given a conjunctive keyword-search 
query q = {qi, q2, ■ ■ ■ , Qt) from a client, the server returns a set of web pages I where each web page 
p € I contains all terms from q. Namely, it is the case that 



The server is now going to compute a proof so that a client can verify that all web pages included 
in X contain qi,q2, ■ ■ ■ ,qt and ensure that no web page from the collection that satisfies query q is 
omitted from I. Namely the server needs to prove that I is the correct intersection SinS2ri- ■ -CiSt. 

One way to compute such a proof would be to have the server just send all the elements in 
Si, S2, ■ ■ ■ , St along with Merkle tree proofs for T[l], T[2], . . . , T[t]. The contents of these sets could 
be verified and the client could then compute the intersection locally. Subsequently the client could 
check to see if the returned intersection is correct or not. The drawback of this solution is that it 
involves linear communication and verification complexity, which could be prohibitive, especially 
when the sets are large. 

To address this problem, in CRYPTO 2011, Papamanthou, Tamassia and Triandopoulos |23j 
observed that it suffices to certify succinct relations related to the correctness of the intersection. 
These relations have size independent of the sizes of the sets involved in the computation of the in- 
tersection, yielding a very efficient protocol for checking the correctness of the intersection. Namely 
I = Si n S2 ri ... n St is the correct intersection if and only if: 

1. Z Si A . . . AZ C St (subset condition); 

2. {Si — X) n . . . n (5f — X) = (completeness condition). 

Accordingly, for every intersection X = {yi,y2, . . . ,ys} the server constructs the proof that 
consists of four parts: 

A Coefficients bs, bs-i, . . . ,bo of the polynomial (s + yi){s + ^2) . . . (s + ys) associated with the 
intersection X; 

B Accumulation values T\j] associated with the sets Sj, along with their respective proofs proof (j), 
output from calling algorithm query(j, T, merkle(T)), for j = 1, . . . , t; 

C Subset witnesses Wxj = 9^^^'^\ for j = 1, . . . , t, where 



z = Si nS2n...n St. 




n + 



x&Sj—I 



D Completeness witnesses J-xj = g'^^^^^ for j 



1, . . . , t, such that 



qi{s)Pi{s) + q2{s)P2{s) + ... + qt{s)Pt{s) = 1 , 



where Pj{s) are the exponents of the subset witnesses. 
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Intuition. Part A comprises an encoding of the result (as a polynomial) that allows efficient 
verification of the two conditions. Part B comprises the proofs for the 1-level integrity protection 
based on Merkle hash trees. Part C comprises a subset-membership proof for the 2-level integrity 
protection based on the bilinear accumulators. Part D comprises a set-disjointness proof for the 
2-level integrity protection based on the extended Euclidean algorithm for finding the interrelation 
of irreducible polynomials. 

3.4 Client 

The client verifies the intersection X by appropriately verifying the corresponding proof elements 
described above: 

A It first certifies that coefficients 65, 60 are computed correctly by the server, i.e., that they 

correspond to the polynomial na:ex('* + checking that X^^^q^*'^* equals Haiexl'^ + ^) 

for a randomly chosen value k G Z*; 

B It then verifies T\f\ for each term qj that belongs to the query (j = 1, . . . , t), by using algorithm 
verify(proof(j), TL?'], sig(r)); 

C It then checks the subset condition 

e(n(/^,Wx,,) =e(r[j],5) for i = 1, . . . , t ; 
\fe=o / 

D Finally, it checks that the completeness condition holds 

t 

ne(Wxj,J^x,,) = e(5,5)- (1) 

The client accepts the intersection as correct if and only if all the above checks succeed. 

Intuition. Step A corresponds to an efficient, high-assurance probabilistic check of the consistency 
of the result's encoding. Steps C and D verify the correctness of the subset and set-disjointness 
proofs based on the bilinearity of the underlying group and cryptographic hardness assumptions 
that are related to discrete-log problems. 

3.5 Final Protocol 

We now summarize the protocol of our solution. 
Web crawler. Given a security parameter: 

1. Process a collection of webpages and create an inverted index. 

2. Generate a description of the group and bilinear pairing parameters (p, G, e, g). 

3. Pick randomly a secret key s G Z*. 

4. Compute accumulation value T[i] for each term i in the inverted index. 

5. Build a Merkle hash tree merkle(T) and sign the root of the tree hr as sig(T). 
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1 n 

6. Compute values g'^ , . . . ,g'^ , where 
n > max{m,maxi=o,...,m-i{|5'j|}}. 

7. Send inverted index and merkle(7') to the server. 

8. PubUsh sig(T), {p,G,Q,e, g) and g'^^ , g'^^ , . . . , g'^" so that the server can access them to com- 
pute the proof and chents can acquire them during verification. 

Untrusted server. Given a query q = {qi, q2, ■ ■ ■ , qt}- 

1. Compute the answer for q as the intersection I = {yi,y2, ■ • • , ys} of inverted hsts correspond- 
ing to q. 

2. Compute the coefficients 65, bs-i, ■ ■ ■ ,bo corresponding to the polynomial {s+yi){s+y2), ■ ■ ■ , (s+ 
y<5)- 

3. Use merkle(7') to compute the integrity proofs of T[j]. 

4. Compute subset witnesses Wxj- = g^^^'^\ 

5. Compute completeness witnesses J-jj = g''^^^\ 

6. Send I and all components of the proof to the client. 

Client. Send query q to the server, and given an answer to the query and the proof, accept the 
answer as correct if all of the following hold: 

1. Coefficients of the intersection are computed correctly: Pick a random k S Z* and verify that 

Ylk=o ^i'^^ ~ Yixexi'^ ~^ (Note that the client can verify the coefficients without knowing 
the secret key s.) 

2. Accumulation values are correct: Verify integrity of these values using sig(7') and merkle(7'). 

3. Subset and completeness conditions hold. 

Example: We now consider how our protocol works on a toy collection. Consider an inverted index 



Term ID 


Term 


Inverted list 





computer 


6,8,9 


1 


disk 


1,2,4,5,6,7 


2 


hard 


1,3,5,7,8,9 


3 


memory 


1,4,7 


4 


mouse 


2,5 


5 


port 


3,5,9 


6 


ram 


5,6,7 


7 


system 


1,7 



Table 1: An example of an inverted index. 

in Table[T]where a term is mapped to a set of documents where it appears, i.e., an inverted list. For 
example, term "mouse" appears in documents 2 and 5 and document 2 contains words "disk" and 
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h,= hash(h„3l| 

h„3= hash(h„, || h^^) 
h„,= hash(hj| h 




ho=hash(0||T[0]) 



T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] 
computer disk hard memory mouse port ram system 

Figure 5: Merkle tree for authenticating the accumulation values of the terms in Table [T} 

"mouse". The crawler computes accumulation values T[i] for each term id i, e.g., an accumulation 
value for term "memory" is 

and builds a Merkle Tree where each leaf corresponds to a term, see Figure pj The Merkle tree and 
the inverted index are sent to the server. The crawler also computes , . . . and publishes 

them along with a signed root of the Merkle tree, sig(T). 

Given a query q = (hard AND disk AND memory), the result is the intersection I of the inverted 
lists for each of the terms in the query. In our case, X = {1,7}. The server builds a proof and sends 
it to the client along with the intersection. The proof consists of the following parts: 

A Intersection proof: Coefficients bo = 7, bi = 8 and 62 = 1 of the intersection polynomial 
(s + l)(s + 7). 

B Accumulation values proof: values ^[O], T[l] and T[2] with a proof from the Merkle tree that 
these values are correct: 



m 

T[2] 

m 



{ho, h23, h^f} 

{^3, /iQl; ^47} 
{/l2, hoi, /l47} 



C Subset witnesses g^'^^^\ 9^^^'^^ and g^^^^^ for each term in the query where 

Pi(s) = (s + 2)(s + 4)(s + 5)(s + 6), 
^'2(s) = (s + 3)(s + 5)(s + 8)(s + 9), 
Ps{s) = (s + 4). 

D Completeness witnesses: Using the Extended Euclidean algorithm the server finds values g'''^^^\ 
g<i2{s) and such that 

qi{s)Pi{s) + q2{s)P2{s) + q3{s)P3{s) = 1 

Note that since the server knows values 5^', it can compute values g^^^^^ and g'^i^^^ without knowing 
a private key s (only the coefficients of the polynomials are required). 
To verify the response from the server, the client: 
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A Picks random k G Z* and checks that Ylk=o ^^^^ ~ 
nxex('^ + x), e.g., 7 + 8k + «;2 = (k + !)(«; + 7). 

B Verifies that each T[i] is correct, e.g., for ^[l] and its proof {/iq, /i23; ^47} the chent checks that 

hr = hash(hash(hash(/io, hash(l||7'[l])), /123), /i47), 
such that sig(T) is a signed root of the Merkle tree hr- 
C Checks subset condition: 




D Checks completeness: 



n 



=1 



g^A^) =e(r[j],ff) for j = 1,2. 



3.6 Security 

With respect to the security, we show that given an intersection query (i.e., a keyword search) 
referring to keywords qi,q2, ■ ■ ■ ,qt, any computationally-bounded adversary cannot construct an 



incorrect answer I and a proof vr that passes the verification test of Section 3.4 except with 
negligible probability. The proof is by contradiction. Suppose the adversary picks a set of indices 
q = {l,2,...,t} (wlog), all between 1 and m and outputs a proof it and an incorrect answer 
X / I = Si n S2 Ci ■ ■ ■ Ci St- Suppose the answer a{q) contains d elements. The proof vr contains 
(i) Some coefficients bo,bi, . . . , bd', (ii) Some accumulation values accj with some respective proofs 
Hj, for j = l,...,t; (iii) Some subset witnesses Wj with some completeness witnesses fj, for 
j = 1, . . . ,t (inputs to the verification algorithm). 

Suppose the verification test on these values is successful. Then: (a) By the certification proce- 
dure, /3o, /3i, . . . , /3rf are indeed the coefficients of the polynomial Yl^exi^ ^ ^) ^ except with negligible 
probability; (b) By the properties of the Merkle tree, values accj are indeed the accumulation values 
of sets Sj, except with negligible probability; (c) By the successful checking of the subset condition, 
values \Nj are indeed the subset witnesses for set I (with reference to Sj), i.e., Wj = g^^^^\ except 
with negligible probability; (d) However, since X is incorrect then it cannot include all the elements 
and there must be at least one element a that is not in X and is a common factor of polynomials 
Pi(s), P2(s), • • . , Pt{s)- In this case, the adversary can divide the polynomials Pi{s), P2{s), . . . , Pt{s) 
with s + a in the completeness relation of Equation [l] and derive the quantity e{g, g)^^^'^'^"'^ at the 
right hand side. However, this implies that the adversary has solved in polynomial time a difficult 
problem in the target group G, in particular, the adversary has broken the bilinear g-strong Diffie- 
Hellman assumption, which happens with negligible probability. More details on the security proof 
can be found in [23j . 

4 Performance 

In this section, we describe a prototype implementation of our authenticated crawler system and 
discuss the results of experimentation regarding its practical performance. 
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4.1 Performance Measures 

We are interested in studying four performance measures: 

1. The size of the proof of a query result, which is sent by the server to the cUent. Asymptoticahy, 
the proof size is 0{S + tlogm) when 5 documents are returned while searching for t terms 
out of the m total distinct terms that are indexed in the collection. This parameter affects 
the bandwidth usage of the system. 

2. The computational effort at the server for constructing the proof. Let be the total size 
of the inverted lists of the query terms. The asymptotic running time at the server is 
0{N log^ NloglogN) [23]. Note that the overhead over the time needed to compute a plain 
set intersection, which is 0{N), is only a polylogarithmic multiplicative. In practice, the 
critical computation at the server is the extended Euclidean algorithm, which is executed to 
construct the completeness witnesses. 

3. The computational effort at the client to verify the proof. The asymptotic running time at 
the client is 0{6 + tlogm) for t query terms, 6 documents in the query result and an inverted 
index of m distinct terms. 

4. The computational effort at the crawler to update the authenticated data structure when 
some documents are added to or deleted from the collection. The asymptotic running time 
at the crawler consists of updating accumulation values and corresponding Merkle tree paths 
for t' unique terms that appear in n' updated documents, and, hence, it is 0{t'n' + t'logm). 

4.2 Prototype Implementation 

Our prototype is built in C++ and is split between three parties: authenticated crawler, search 
engine and client. The interaction between the three proceeds as follows. 

The crawler picks a secret key s and processes a collection of documents. After creating the 
inverted index, the crawler computes an accumulation value for each inverted list. We use cryp- 
tographic pairing from [15] for all bilinear pairing computations which are available in DCLXVI 



This library implements an optimal ate pairing on a Barreto-Naehrig curve over a prime 



field Fp of size 256 bits, which provides 128-bit security level. Hence, the accumulation value of 
the documents containing a given term is represented as a point on a curve. Once all accumulation 
values are computed, the crawler builds a Merkle tree where each leaf corresponds to a term and 
its accumulation value. We use SHA256 from the OpenSSL librarjj^ to compute a hash value at 
each node of the Merkle tree. The crawler also computes values g, g^, . . ., g^" . 

After the authenticated data structure is built, the crawler outsources the inverted index, au- 
thentication layer and precomputed values g, g^, ■ ■ to the server. 

Clients query the search engine via the RPC interface provided by the Apache Thrift software 
framework [Ij. For each query, the server computes the proof consisting of four parts. To efficiently 
compute the proof, the server makes use of the following algorithms 

• The coefficients bs,bs-i, . . . ,bQ and the coefficients for the subset witnesses are computed 



• The Extended Euclidean algorithm is used to compute the coefficients qi{s), 52(5), ... , qtis) 
for the completeness witnesses. 




using EFT. 



^.http : //www . openssl . org/ 



http : //www . crypto jedi . org/crypto/| 
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We use the NTlJ^and LiDIA[^ libraries for efficient arithmetic operations, FFT and the Euchdean 
algorithm on elements in Z*, which represent document identifiers. Bilinear pairing, power and 
multiplication operations on the group elements are performed with the methods provided in the 
bilinear map library. 

We performed the following optimizations on the server. The computation of subset and com- 
pleteness witnesses is independent of each other and hence is executed in parallel. We also noticed 
that the most expensive part of the server's computation is the power operation for group ele- 
ments when computing subset and completeness witnesses. Since the order of these computations 
is independent from each other we run the computation of g^^i^) and ^r^J^*) in parallel. 

The client's verification algorithm consists of verifying the accumulation values using the Merkle 
tree and running the bilinear pairing procedure over the proof elements. 



4.3 Experimental Results 

We have conducted computational experiments on a 8-core Xeon 2.93 processor with 8Gb RAM 
running 64-bit Debian Linux. In the discussion of the results, we use the following terminology. 

• Total set size: sum of the lengths of the inverted lists of the query terms. This value corre- 



sponds to variable used in the asymptotic running times given in Section 4.1 



Intersection size: number of documents returned as the query result. This value corresponds 



to variable 6 used in the asymptotic running times given in Section 4.1 



4.3.1 Synthetic Data 

We have created a synthetic data set where we can control the frequency of the terms as well as 
the size of the intersection for each query. The synthetic data set contains 777,348 documents and 
320 terms. Our first experiment identifies how the size of the intersection, 5, influences the time it 
takes for the server to compute the proof given that the size of the inverted lists stays the same. 
We report the results in Figure |6(a)| where each point corresponds to a query consisting of two 
terms. Each term used in a query appears in 2,000 documents. As the intersection size grows, the 



size of the polynomial represented by the subset witness Pj{s) in Section 3.3 decreases. Hence the 
time it takes the server to compute the proof decreases as well. 

We now measure how the size of the inverted lists affects server's time when the size of the 



resulting intersection is fixed to 5 = 100 documents. Figure 6(b) shows results for queries of two 
types. 

• The first type consists of queries where both terms appear in the collection with the same 
frequency. 

• The second type of queries contain a frequent and a rare term. 

In each query we define a term as rare if its frequency is ten times less than the frequency of 
the other term. As the number of terms for subset witnesses grows the time to compute these 



witnesses grows as well. The dependency is linear, as expected (see Section 4.1). We also note that 
the computation is more expensive when both terms in the query have the same frequency in the 
collection. 



*A Library for doing Number Theory, V5.5.2. 

library for Computational Number Theory, V2.3. 
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Figure 6: Computational effort at the server on a synthetic data set. (a) Time to compute the 
proof for a query with two terms as a function of the intersection size, i.e., the number of returned 
query results. The length of the inverted list for each term is fixed at 2,000 documents, (b) Time 
to compute the proof for two types of queries: two frequent terms and frequent-rare term pair. The 
intersection size for each query, i.e., the number of returned documents, is 100. 



4.3.2 WSJ Corpus 

We have also tested our solution on a real data set that consists of 173,252 articles published in 
the Wall Street Journal from March 1987 to March 1992. After preprocessing, this collection has 
135,760 unique terms. We have removed common words, e.g., articles and prepositions, as well as 
words that appear only in one document. The distribution of lengths of the inverted lists is shown 
in Figure 7(a)| where 56% of the terms appear in at most 10 documents. 



Our query set consists of 200 random conjunctive queries with two terms. We picked queries that 
yield varying result sizes, from empty to 762 documents. Since each term in a query corresponds 
to an inverted list of the documents it appears in, we also picked rare as well as frequent terms. 
Here, we considered a term frequent if it appeared in at least 8,000 documents. The total set size 
and corresponding intersection size for each query is shown in Figure |7(b)[ 

Server time. We first measure the time it takes for the server to compute the proof. In Figure 



we show how the size of the inverted lists of the terms in the query influences server's time. As 
expected, the dependency is linear in the total set size. However, some of the queries that have 
inverted lists of close length result in different times. This happens because the intersection size 
varies between the queries, as can be seen in Figure [7(b)[ Furthermore, some of the queries contain 
different type of terms, e.g., consider a query with one rare and one frequent word, and a query 



with two semi-frequent words (see Section 4.3.1). 



In Figure |8(b) we show how the intersection size influences server's time. Note that the graph 
is almost identical to Figure |7(b)[ again showing that the time mostly depends on the lengths of 
the inverted lists and not the intersection size. 

Client time and proof size. We now measure the time it takes for the client to verify the proof. 
Following the complexity analysis of our solution, the computation on the client side is very fast. We 
split the computation time since verification of the proof consists of verifying that intersection was 
performed on correct inverted lists (Merkle tree) and that intersection itself is computed correctly 
(bilinear pairing on accumulation values). In Figure [9| we plot the time it takes to verify both 
versus the intersection size: It depends only on the intersection size and not on the total set size 
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Figure 7: (a) Distribution of the length of the inverted hsts for the WSJ corpus, (b) Query set 
for WSJ corpus. Each point in the graph corresponds to a query and its intersection size in the 
corpus, i.e. the number of documents that have all terms from the query. 




2000 4000 6000 8000 10000 12000 14000 16000 100 200 300 400 500 600 700 800 

Total Set Size Intersection Size 



(a) (b) 

Figure 8: Computational effort at the server for queries with two terms on the WSJ corpus. Time 
to compute the proof as a function of (a) the total set size and (b) the intersection size. The total 
set size is the total length the inverted lists of the query terms. 



(the lengths of the inverted lists of the query terms) . Finally, the size of the proof sent to the client 



is proportional to the intersection size as can be seen in Figure 10 



Updates to the corpus. The simulation supports addition and deletion of new documents and 
updates corresponding authenticated data structures. We pick a set of 1500 documents from the 
original collection which covers over 14% of the collection vocabulary. In Figure 11 we measure the 
time it takes for the crawler to update accumulation values in the authenticated data structure. 
As expected the time to do the update is linear in the number of unique terms that appear in 
the updated document set. Deletions and additions take almost the same time since the update is 
dominated by a single exponentiation of the accumulation value of each affected term. Updates to 
Merkle tree take milliseconds. 
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Figure 9: Time at the client to verify the proof 
as a function of the intersection size. The time is 
split between verifying the leaves of Merkle tree 
and integrity of intersection using accumulation 
values. 
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Figure 10: Size of the proof as a function of the 
intersection size. 
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Figure 11: Time it takes for the crawler to update the authenticated data when documents are 
added or deleted from the original collection. 



4.3.3 Comparison with Previous Work 

The closest work to ours is the method developed by Pang and Mouratidis [21j. However, their 
method solves a different problem. Using our method, the user can verify whether the result of 
the query contains all the documents from the collection that satisfy the query and no malicious 
documents have been added. The method of [2T] proves that the returned result consists of top- 
ranked documents for the given query. However, it does not assure the completeness of the query 
result with respect to the query terms. 

The authors of show that their best technique achieves below 60 msec verification timej^ 
and less than 50 Kbytes in proof size for a result consisting of 80 documents. Using our method the 
verification time for a result of 80 documents takes under 17.5 msec (Figure[9]) and the corresponding 



verification object is of size under 7 Kbytes (Figure 10). The computation effort by the server 
reported in |21] is lower than it is for our method, one second versus two seconds. 

We also note that updates for the solution in [21] require changes to the whole construction. 



=Dual Intel Xeon 3GHz CPU with 512MB RAM machine. 



21 



while updates to our authenticated data structures are hnear in the number of unique terms that 
appear in new documents. 



4.3.4 Improvements and Extensions 

From our experimental results, we observe that the most expensive part of our solution is the 
computation of subset and completeness witnesses at the server. This is evident when a query 
involves frequent terms with long inverted lists, where each term requires a call to a multiplication 
and power operation of group elements in G. However, these operations are independent of each 
other and can be executed in parallel. Our implementation already runs several multiplication 
operations in parallel. However, the number of parallel operations is limited on our 8-core processor. 

In a practical deployment of our model, the server is in the cloud and has much more computa- 
tional power than the client, e.g., the server is a search engine and the client is a portable device. 
Hence, with a more powerful server, we can achieve faster proof computation for frequent terms. 

Our implementation could use a parallel implementation of the Extended Euclidean Algorithm, 
however, the NTL library is not thread-safe and therefore we could not perform this optimization 
for our current prototype. 



5 Conclusion 

We study the problem of verifying the results of a keyword-search query returned by a search engine. 
We introduce the concept of an authenticated web-crawler which enables clients to verify that the 
query results returned by the search engine contain all and only the web pages satisfying the query. 
Our prototype implementation has low communication overhead and provides fast verification of 
the results. 

Our method verifies the correctness and completeness of the results but does not check the 
ranking of the results. An interesting extension to our method would be to efficiently verify also 
the ranking, i.e., return to the client r pages and a proof that the returned set consists of the top-r 
results. 
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